{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. _country_userguide:\n",
    "\n",
    "Country Names\n",
    "============="
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Introduction\n",
    "------------\n",
    "\n",
    "The function :func:`clean_country() <dataprep.clean.clean_country.clean_country>` cleans a column containing country names and/or `ISO 3166 <https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes>`_ country codes, and standardizes them in a desired format. The function :func:`validate_country() <dataprep.clean.clean_country.validate_country>` validates either a single country or a column of countries, returning True if the value is valid, and False otherwise. The countries/regions supported and the regular expressions used can be found on `GitHub <https://github.com/sfu-db/dataprep/blob/develop/dataprep/clean/country_data.tsv>`_.\n",
    "\n",
    "Countries can be converted to and from the following formats via the ``input_format`` and ``output_format`` parameters:\n",
    "\n",
    "* Short country name (name): \"United States\"\n",
    "* Official state name (official): \"United States of America\"\n",
    "* ISO 3166-1 alpha-2 (alpha-2): \"US\"\n",
    "* ISO 3166-1 alpha-3 (alpha-3): \"USA\"\n",
    "* ISO 3166-1 numeric (numeric): \"840\"\n",
    " \n",
    "``input_format`` can be set to \"auto\" which automatically infers the input format. A tuple of input formats may also be used to indicate that the input may be any of the given input formats. \n",
    "\n",
    "The ``strict`` parameter allows for control over the type of matching used for the \"name\" and \"official\" input formats.\n",
    "\n",
    "* False (default for ``clean_country()``), search the input for a regex match\n",
    "* True (default for ``validate_country()``), look for a direct match with a country value in the same format\n",
    "\n",
    "The ``fuzzy_dist`` parameter sets the maximum edit distance (number of single character insertions, deletions or substitutions required to change one word into the other) allowed between the input and a country regex. \n",
    "\n",
    "* 0 (default), countries at most 0 edits from matching a regex are successfully cleaned\n",
    "* 1, countries at most 1 edit from matching a regex are successfully cleaned\n",
    "* n, countries at most n edits from matching a regex are successfully cleaned\n",
    "\n",
    "Invalid parsing is handled with the ``errors`` parameter:\n",
    "\n",
    "* \"coerce\" (default): invalid parsing will be set to NaN\n",
    "* \"ignore\": invalid parsing will return the input\n",
    "* \"raise\": invalid parsing will raise an exception\n",
    "\n",
    "After cleaning, a **report** is printed that provides the following information:\n",
    "\n",
    "* How many values were cleaned (the value must have been transformed).\n",
    "* How many values could not be parsed.\n",
    "* A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.\n",
    "  \n",
    "The following sections demonstrate the functionality of ``clean_country()`` and ``validate_country()``. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### An example dataset with country values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "df = pd.DataFrame({\n",
    "    \"country\": [\n",
    "        \"Canada\", \"foo canada bar\", \"cnada\", \"northern ireland\", \" ireland \",\n",
    "        \"congo, kinshasa\", \"congo, brazzaville\", 304, \"233\", \" tr \", \"ARG\",\n",
    "        \"hello\", np.nan, \"NULL\"\n",
    "    ]\n",
    "})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Default `clean_country()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, the `input_format` parameter is set to \"auto\" (automatically determines the input format), the `output_format` parameter is set to \"name\". The `fuzzy_dist` parameter is set to 0 and `strict` is False. The `errors` parameter is set to \"coerce\" (set NaN when parsing is invalid)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.clean import clean_country\n",
    "clean_country(df, \"country\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note \"Canada\" is considered not cleaned in the report since it's cleaned value is the same as the input. Also, \"northern ireland\" is invalid because it is part of the United Kingdom. Kinshasa and Brazzaville are the capital cities of their respective countries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Input formats\n",
    "\n",
    "This section demonstrates the supported country input formats.\n",
    "\n",
    "### name\n",
    "\n",
    "If the input contains a match with one of the country regexes then it is successfully converted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_country(df, \"country\", input_format=\"name\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### official\n",
    "\n",
    "Does the same thing as `input_format=\"name\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_country(df, \"country\", input_format=\"official\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### alpha-2\n",
    "\n",
    "Looks for a direct match with a ISO 3166-1 alpha-2 country code, case insensitive and ignoring leading and trailing whitespace."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_country(df, \"country\", input_format=\"alpha-2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### alpha-3\n",
    "\n",
    "Looks for a direct match with a ISO 3166-1 alpha-3 country code, case insensitive and ignoring leading and trailing whitespace."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_country(df, \"country\", input_format=\"alpha-3\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### numeric\n",
    "\n",
    "Looks for a direct match with a ISO 3166-1 numeric country code, case insensitive and ignoring leading and trailing whitespace. Works on integers and strings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_country(df, \"country\", input_format=\"numeric\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (name, alpha-2)\n",
    "\n",
    "A tuple containing any combination of input formats may be used to clean any of the given input formats."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_country(df, \"country\", input_format=(\"name\", \"alpha-2\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Output formats\n",
    "\n",
    "This section demonstrates the supported output country formats.\n",
    "\n",
    "### official"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_country(df, \"country\", output_format=\"official\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### alpha-2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_country(df, \"country\", output_format=\"alpha-2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### alpha-3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_country(df, \"country\", output_format=\"alpha-3\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### numeric"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_country(df, \"country\", output_format=\"numeric\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Any combination of input and output formats may be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_country(df, \"country\", input_format=\"alpha-2\", output_format=\"official\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. `strict` parameter\n",
    "\n",
    "This parameter allows for control over the type of matching used for \"name\" and \"official\" input formats. When False, the input is searched for a regex match. When True, matching is done by looking for a direct match with a country in the same format. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_country(df, \"country\", strict=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"foo canada bar\", \"congo kinshasa\" and \"congo brazzaville\" are now invalid because they are not a direct match with a country in the \"name\" or \"official\" formats. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Fuzzy Matching\n",
    "\n",
    "The `fuzzy_dist` parameter sets the maximum edit distance (number of single character insertions, deletions or substitutions required to change one word into the other) allowed between the input and a country regex. If an input is successfully cleaned by `clean_country()` with `fuzzy_dist=0` then that input with one character inserted, deleted or substituted will match with `fuzzy_dist=1`. This parameter only applies to the \"name\" and \"official\" input formats."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `fuzzy_dist=1`\n",
    "\n",
    "Countries at most one edit away from matching a regex are successfully cleaned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({\n",
    "    \"country\": [\n",
    "        \"canada\", \"cnada\", \"australa\", \"xntarctica\", \"koreea\", \"cxnda\",\n",
    "        \"afghnitan\", \"country: cnada\", \"foo indnesia bar\"\n",
    "    ]\n",
    "})\n",
    "clean_country(df, \"country\", fuzzy_dist=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `fuzzy_dist=2`\n",
    "\n",
    "Countries at most two edits away from matching a regex are successfully cleaned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_country(df, \"country\", fuzzy_dist=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. `inplace` parameter\n",
    "This just deletes the given column from the returned dataframe. \n",
    "A new column containing cleaned coordinates is added with a title in the format `\"{original title}_clean\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_country(df, \"country\", fuzzy_dist=2, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. `validate_country()`\n",
    "\n",
    "`validate_country()` returns True when the input is a valid country value otherwise it returns False. Valid types are the same as `clean_country()`. By default `strict=True`, as opposed to `clean_country()` which has `strict` set to False by default. The default `input_type` is \"auto\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.clean import validate_country\n",
    "\n",
    "print(validate_country(\"switzerland\"))\n",
    "print(validate_country(\"country = united states\"))\n",
    "print(validate_country(\"country = united states\", strict=False))\n",
    "print(validate_country(\"ca\"))\n",
    "print(validate_country(800))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `validate_country()` on a pandas series\n",
    "\n",
    "Since `strict=True` by default, the inputs \"foo canada bar\", \"congo, kinshasa\" and \"congo, brazzaville\" are invalid since they don't directly match a country in the \"name\" or \"official\" formats."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({\n",
    "    \"country\": [\n",
    "        \"Canada\", \"foo canada bar\", \"cnada\", \"northern ireland\", \" ireland \",\n",
    "        \"congo, kinshasa\", \"congo, brazzaville\", 304, \"233\", \" tr \", \"ARG\",\n",
    "        \"hello\", np.nan, \"NULL\"\n",
    "    ]\n",
    "})\n",
    "\n",
    "df[\"valid\"] = validate_country(df[\"country\"])\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `strict=False`\n",
    "For \"name\" and \"official\" input types the input is searched for a regex match."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"valid\"] = validate_country(df[\"country\"], strict=False)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Specifying `input_format`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"valid\"] = validate_country(df[\"country\"], input_format=\"numeric\")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Credit\n",
    "\n",
    "The country data and regular expressions used are based on the [country_converter](https://github.com/konstantinstadler/country_converter) project."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
